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TOPOLOGICAL VOICEPRINTS FOR SPEAKER IDENTIFICATION 



[0001] This application claims the benefit of U.S. Provisional 
Patent Application No. 60/497,007 entitled '"TOPOLOGICAL VOICEPRINTS 
5 FOR SPEAKER IDENTIFICATION" and filed August 20, 2003, the entire 
disclosure of which is incorporated herein by reference as part of 
the specification of this application. 

Backgroiand 

[0002] This application relates to identification of speakers by 
10 voices. 

[0003] Voices of different persons have different voice 
characteristics. The differences in voice characteristics of 
different persons can be extracted to construct unique 
identification tools to distinguish and identify speakers. To a 

15 certain extent, speaker recognition is a process of automatically 

recognizing who is speaking on the ba^is of individual information qq| 
obtained from voices or speech signals.- In various applications, 
speaker recognition may be divided- into Speaker Identification and 
Speaker Verification. Speaker identification determines which ^ 

20 registered speaker provides a giye^t:.ii.tt'e ranee amongst a set of known 

speakers. The given utterance is analyzed and compared to the voice Q 
information of the known speakers to determine whether there is a 
match. In speaker verification, an unknown speaker first claims an ^ 
entity of a known speaker and an utterance from the unknown speaker O 

25 is obtained and compared against voice information of the claimed 
known speaker to determine whether there is a match. 
[0004] Speaker recognition technology has many applications. For 
example, a speaker's voice may be used 'to control access to 
restricted facilities, devices, coitlputer systems, databases, and 

30 various services, such as telephonic 'access to banking, database 
services, shopping or voice mail, and access to secured equipment 
and computer systems. In both spea:ker ; identification and 
verification, users are required • to "enroll" in the speaker 
recognition system by providing examples of their speech so that the 

35 system can characterize and analyze users' voice patterns. 

[0005] In the field of speaker recognition, various speaker 
recognition methods have been developed to use distances between 
vectors of voice features, e.g., spectral parameters, to identify 
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speakers. In such spectral analysis methods^ the distances between 
extracted voice features and voi*ce templates of known speakers are 
computed. Based on statistical or' other suitable analysis, if the 
computed distances for received voices or utterances are within 
predetermined threshold values for a'lchbwn speaker, then received 
voices or utterances are assigned tb'that known speaker. 

Summary 

[0006] The speaker recognition techniques described in this 
application were developed in part based on the recognition of 
various technical limitations in various spectral analysis methods 
based on computation of distances of spectral parameters. For 
example, such spectral analysis methods may not be sufficiently 
accurate at least because different xitt'erances of the same speaker 
may have somewhat different spectraf'kiid the decision is essentially 
15 dependent on a voice spectral database that is used to fit the 
appropriate threshold. 

[0007] The speaker recognition techniques of this application use 
topological features in voices that-, ^re.: computed from each 
individual speaker to construct a »s6t; of discrete rational numbers, 

20 such as integers, as a biometric characterization for each speaker 
and use such rational numbers to identify a speaker or a subject 
under examination. Distinctly different from computing distances 
between spectral curves obtained from voices of different speakers 
in various spectral analysis methods such topological features 

25 provide a one-to-one correspondence between a subject and a mold or 
voiceprint represented by a set bf'rafional niimbers. Therefore, a 
database of such rational numbers -for- different known speakers may 
be formed for various applications,' including speaker identification 
and verification- A database of such rational numbers is small 

30 relative to a conventional voice databank for a person used in 

various spectral analysis methods. Each voice print includes a set 
of topological parameters in form of discrete integers or rational 
numbers to distinguish a speaker from other speakers and is derived 
from an embedding of spectral functions of the speaker's voice. 

35 [0008] In one implementation, a method for determining an identity 
of a speaker by voice is described.. First, a set of topological 
indices are extracted from an embedding of spectral functions of a 
speaker's voice. Next, a selection of -the topological indices is 
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used as a biometric characterization of . the speaker to identify and 
verify the speaker from other speakers. 

[0009] In another implementation , the topological parameters are 
rational numbers such as integers obtained from the relative 
5 rotation rates (rrr) . Each subject is assigned with a set of 

rational numbers that can be reconstructed from brief utterances . A 
subset of these numbers does not, change from utterance to utterance 
of the same speaker, and are different from subject to subject. In 
this way, a standard way to describe the voice can be established, 
10 independently of the size of the features of the database. The set 
of rational numbers characterizing the, yo.ice is robust, and can be 
easily coded in various devices, such. as. magnetic or printed 
devices . • r . 

[0010] An exemplary method described in this application includes 
15 the following steps. A speech signal -from a speaker is recorded and 
digitized- Linear prediction coef ficjlents of the discrete signal 
are computed. The power spectrum is computed from the linear 
prediction coefficients. Next, a three-dimensional periodic orbit 
is constructed from the power spectrum^and a second three- 
20 dimensional periodic orbit is also -Gdhstructed from a power spectrum 
of a reference such as a natural reference signal. The topological 
information about the periodic orbits *'of the speech signal and the 
natural reference signal is then obtained. A selective set of 
topological indices is used to distinguish a speaker who produces 
25 the speech signal from other speakers who have different topological 
indices, 

[0011] This application also describes ' speaker recognition systems. 
In one example, a speaker recognition, system includes a microphone 
to receive a voice sample from a speaker, a reader head to read 

30 voice identification data of rational n\imbers that uniquely 

represent a voice of a known speaker- from a portable storage device, 
and a processing unit. The processing unit is connected to the 
microphone and the reader head and^is ■ operable to extract 
topological information from the voice sample of the speaker to 

35 produce topological discrete numbers f.from the voice sample. The 

processing unit is also operable to* compare the discrete numbers of 
the known speaker to the topological discrete numbers from the voice 
sample to deteinnine whether the speaker is the known speaker. 
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Because the file size for digital codes of the discrete rational 
numbers for speaker recognition is sufficiently small, one or more 
voiceprints for one or more speakers caji be stored in the portable 
storage device that can be carried with' a user. 
5 [0012] These and other examples and implementations are described 
in greater detail in the attached ..dr.avi.ng, the detailed description, 
and the claims . 

Brief Description eg the Drawing 
[0013] FIG. 1 shows examples of periodic functions used for the 
10 embedding from a single speaker (solid lines) and a universal 

reference (dotted line) . These functions are constructed from the 

original log \H{f)f' using one half of the original period. 

[0014] FIG. 2 shows three examples- of . log \H(f}^ using the maximum 
entropy approximation for two differefit" speakers over the complete 
15 period of the function. Beyond the second formantr the spectra 
naturally cluster in two different grdup'^'. The original sound 
segments correspond to the Spanish ''v^tl^el'- [a] extracted from normal 
speech utterances. 

[0015] FIG. 3 shows an example of a delay embedding (Af = 40 Hz) of 
20 the function F(f) computed from one voiced fragment (solid line) , 

[0016] FIG. 4 shows vowelprints for' three male speakers of nearly 

the same age, constructed from short' vowel segments (- 100 ins) of 

around 10 utterances taken in different enrollment sessions. 

[0017] FIG. 5A shows an example of a voice sample as a function of 
25 time obtained from a speaker via a mxcrophone. 

[0018] FIG. SB shows a power specttum obtained from the voice 

sample in FIG. 5A. 

[0019] FIG. 5C illustrates linking of ' two three-dimensional orbits 
1 and 2 in the topological approach to extract rotation numbers from 
30 voice signals. 

[0020] FIG- 5D shows relative rotation nximbers from the relative 
topological relation between an orbit constructed from a voice 
sample and a reference orbit from a reference signal. 
[0021] FIGS. 6A, 6B, and 6C illustrate an example of the process to 
35 select invariant rotation numbers from multiple rotation matrices 
for the same voiced sound of a speaker as the voiceprint for the 
speaker. 



^4^ 



10 



wo 2005/020208 PCT/US2004/027193 

[0022] FIG. 7 shows an example of comparing voice of a unknown 

speaker to a voiceprint of a known speaker in a full match analysis. 

[0023] FIG. 8 illustrates a procedure for verifying two candidates 

against three voiceprints of three known speakers. 

[0024] FIG. 9 shows an example of a speaker recognition system. 

[0025] FIG. 10 shows operation of the system in FIG. 9. 

Detailed Description 
[0026] The speaker recognition techniques described here may be 
implemented in various forms. In one implementation, for example, a 
set of discrete rational numbers (e.g. integers) is extracted from 
voice samples of a speaker. A subset. of the extracted rational 
numbers are present in each utterance of the speaker and do not vary 
from utterance to utterance of the speaker under normal speech 
15 conditions, and low noise environment. This subset is called 

voiceprint, and it is used as a biometric characterization of the 
speaker to identify and verify the -speaker from other speakers. 
[0027] Hence, speaker verification rtiay be achieved with this 
biometric characterization by the following steps. First, a voice 
20 sample from a second speaker is Analyzed to extract a set of 
rational numbers for the second speaker;. The set of discrete 
rational numbers for the second speaker is compared to the 
voiceprints for the speaker without using a threshold value in the 
comparison. The second speaker is'^'theri^" verified as the speaker when 
25 there is a perfect match between the set of rational numbers for the 
second speaker and the voiceprint for the speaker. If there is not 
a match, the second speaker is identified as a person different from 
the speaker. 

[0028] In an implementation for speaker identification, voiceprints 
30 are extracted from voice samples of different known speakers. Next, 
a voice sample from a unknown speaker is analyzed to extract a set 
of rational numbers for the unknowh,:^peaker and the set of discrete 
rational numbers for the unknown speaker is compared to the 
voiceprints of the known speakers to determine whether there is a 
35 match in order to identify whether the unknown speaker is one of the 
known speakers . 

[0029] Notably, in the above speaker- verification and 
identification processes, a comparison between different sets of 
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discrete rational numbers is made to determine a match. There is no 
need to determine whether a difference between two spectral features 
is within a selected threshold value, ' This and other features of 
the speaker recognition techniques described here are advantageous 
5 over various spectral analysis methods based on .computation of 
distances of spectral parameters . 

[0030] Voice recognition methods are noninvasive identification 
methods and thus, in this regard, are superior to other biometric 
identification procedures such as retina scanning methods. However, 

10 spectral analysis methods for speaker recognition are not as widely 
used as other biometric procedures- including fingerprinting in part 
because of the difficulty of establishing how close is sufficiently 
close for a positive identification when comparing spectral features 
in different voices- The speaker •recognition techniques described 

15 here avoid the uncertainties in uj^ing-'-threshold values to compare 

spectral features and provide a novel approach to the extraction of 
biometric features from speech spectral ^information. 
[0031] The spectral properties of voices of persons are known to ' 
carry unique* traits of the speakers and -thus can be used for speaker 

20 recognition. During the production of Voiced sounds a spectrally 
rich sound signal produced by the modulation of the airflow by the 
vocal folds is filtered by the vocal tract of the speaker. The 
resonances of the vocal tract as a- passive filter are determined by 
ergonomic features of the speaker, and therefore can be used to 

25 identify the speaker. The physics of hioman voice can be described 
in terms of the standard source-filter- theory. During the 
production of voiced sounds like voWels, the airflow induces 
periodic oscillations in the vocal folds-. These oscillations 
generate time varying pressure fluctuations at the input of a 

30 passive linear filter, the vocal tract. The separation between 

source and filter assumes that the feedback into fold oscillations 
is negligible, a hypothesis that has been extensively validated for 
normal speech regime by Laje et al. in Phys . Rev. E64, 05621 (2001). 
The spectrally rich input pressure presents harmonics of a 

35 fundamental frequency of about 100 Hz. The vocal tract selects some 
frequencies out of these harmonics, an this way, the spectrum of a 
voiced sound carries information about the vocal tract that is 
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unique to each speaker and therefore can be used as a biometric 
characterization of the speaker. 

[0032] A typical approach in the field of speaker recognition, such 
as various spectral analysis methods, is to use feature vectors with 
5 quantities that characterize different subjects, perform 

multidimensional clustering and separate the clusters associated 
with the different subjects by means of some metric on the feature 
vectors • In the framework of the spectral characterization of the 
voice, one way to perform an identity validation is to construct a 

10 distance between properties computed from utterances (distortion 
measures), such as the integral of the. difference between the two 
spectra on a log magnitude. Another distortion measure is based 
upon the differences between the spectral slopes, e.g., the first 
order derivatives of the log power . spectra pair with respect to 

15 frequency. . . 

[0033] Such spectral analysis methods suffer a number of technical 
limitations. FIG. 1. shows examples-of • log power spectra of three 
different utterances by the same speaker. The spectra are somewhat 
different in the spectral peaks and shapes for different utterances 

20 from the same speaker. Hence, in computing differences between 
spectral features, it is inherently^dif f icult and challenging to 
measure the distances between curves and decide how much deviation 
is acceptable for speaker recognition; -For example, the computed 
results from such spectral analysis methods are generally scattered 

25 between ranges for different speakers'. 'As such, uncertainties exist 
as to where to set the boundary between^ acceptable values between 
two speakers whose ranges are close. ■ * 

[0034] The speaker recognition techniques described here use an 
entirely different approach to extraction unique biometric features 

30 from voices and utterances. The above ■ spectral comparison may be 
alternatively implemented by means of • another set of coefficients 
called cepstr\im coefficients that ar6: the Fourier amplitudes of the 
spectral function. To a degree, ^ this* implementation may be 
understood as that the voice spectrim is treated as a "time" series 

35 where the frequency, f, plays the role of time. Under this view, 
the present inventors discovered that the techniques used in the 
theory of dynamical systems in order to compare two periodic orbits 
can be used in the analysis of voiced 'sound spectra. This approach 
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to voice information completely avoids the computation of 
differences of spectral features-, .In-* particular, the inventors 
explored the use of topological tools that are designed to capture 
the main morphological features of orbits regardless of slight 
5 deformations. Topological analysis of nonlinear dynamical systems 
is a well established technical field and the basic principles and 
analytical framework are described in detail by Robert Gilmore in 
''Topological analysis of chaotic dynamical systems" in Review of 
Modern Physics, Vol. 70, No. 4, pages 1455-1529 (October, 1998) . 

10 [0035] The following sections describe how to characterize spectra 
by means of sets of rational number^ t^y using topological tools 
developed in a different field for dynamical systems. Notably, 
within a relatively small bank of -speakers, there are subsets of 
rational numbers that seem to streiagthen the speakers' identity 

15 information. These results suggest 'a new direction in the 

identification of subjects by voice t '-one in which arrangements of 
rational numbers define voiceprintsv-thatl'stand on their own, despite 
any acceptance/rejection thresholds: ^ 

[0036] In the analysis of three-dimensional dynamical systems, the 
20 periodic orbits are closed curves that- can be characterized by the 
way in which they are knotted and linked to each other and to 
themselves- See, e.g., Solari and Gilmore in ""Relative rotation 
rates for driven dynamical systems; f:' Physical Review A37, pages 
3096-3109 (1998), Mindlin et al. in :'>)Classif ication of strange 
25 attractors by rational numbers," Physical Review Letters, Vol. 64, 

pages 2350-2353 (1990), and Mindliriv^hd -Gilmore in Physica D58, page 
229 (1992) . For the purpose of applying this analysis to the 
problem of speaker identification, the. power spectrum of voiced 
sounds on a log scale is treated as a periodic string of data, using 
30 techniques commonly applied to the analysis of periodic "time" 
series . A three dimensional orbit can be constructed from this 
string of data using a delay embedding. 

[0037] FIG. 2 shows examples of* log power spectra of three 
vocalizations of two speakers. The spectra naturally cluster in two 
35 sets that correspond to the two speakers, respectively. The 
topological properties of their embeddings are found to be a 
pertinent tool for identity validation. 
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[0038] The relative rotation rates described in the above cited 
publication by Solari and Gilmore are ■ topological invariants 
introduced to help in the description of periodically driven two- 
dimensional dynamical systems and can.be used to extract biometric 
information from spectral properties of human voice. The relative 
rotation rates can also be constructed .for a large class of 
autonomous dynamical systems in R^: those for which a Poincar6 
section can be found. 

[0039] In order to describe the vocal tract frequency response, the 
maximum entropy approximation of the power spectrum for each of the 
stored voiced segments is computed.. - This computation can be 
performed by calculating m linear predictor coefficients for the 
voiced segment {y^} r sampled with a rate of r = l/A: 

yn =Hk=ldkyn-k + X„ , (1) 

where the Ip coefficients dif --••.•>»" 'dj„ are assumed constant over 

the speech segment, and are chosen 'so' that X„ is minimum. These Ip 

coefficients can be used to estimate, .the power spectrum as 
a rational function with m poles: 



Hi/)- 



0 



(2) 



which is periodic in [-1/2A,1/2A], the ..Nyquist interval. The 
spectra of two speakers in FIG. 2 are examples of reconstructed 
spectra based on Equation (2) . 

[0040] The log of power spectral function log log\H{f^ was 

approximated using Equation (2) with m = 13 coefficients. This 
spectrum is symmetric with respect to f = 0. Therefore only one half 
of each spectrum is relevant to the analysis and extraction of the 
topological rational numbers. In processing the original data in 

the voice spectra, we washed out the difference between log|i/(jr)|^ 
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and log|if (;r/ A)[ , adding a linear function and subtracting the 
average. The final spectral function F(f) is a periodic function 
and has a period that is one half of the original period. 
[0041] Referring back to FIG. 1, a few .examples of F(f) for 
different utterances of the same speaker are shown along with a 
reference spectral function. The resulting function F(f) can be 
embedded in the phase space using a delay 8. FIG. 3 further shows 
an example of such an orbit using 5 = 40 Hz. These delay-embedded 
orbits in phase space defined hy. F(f) r F(f-S) , and F(f-2S) always 
display a hole around the line F(f),.'=^ Fj[f'S) = F(f-2S) . Therefore a 
good Poincare section is given by the -semi plane defined by F(f) = 
F(f-2S); F(f'S) < F(f-2S). . .r-v.: . 

[0042] As a topological characterization of these periodic orbits, 
the relative rotation respect to a 'reference is chosen. As an 

15 example, a universal reference is used: -a plain, non articulated 

vocal tract (a zero hypothesis for voiced sounds) . This universal 
reference is bank-independent and corresponds to the embedding of 
the power spectrum of an open-closed uniform tube of a given length 
of 17.5 cm for the examples described in this application. 

20 [0043] The relative rotation of these embedded spectra can be 
calculated as follows by assuming that the orbits have periods 

and Pb. A relative rotation matrix* M € Z^^^^^ for the orbits A and 
B is constructed and the matrix element Mxj corresponds to summing 
the signed crossings of the period' of the orbit A relative to the 
25 j*^ period of the orbit B. The signed, crossings can be calculated by 
projecting the two orbits A and B onto a two-dimensional subspace. 
In this projection, tangent vectors tc> the two periods just over the 
cross are drawn in the direction of the flow. The upper tangent 
vector is rotated into the lower tangent vector, assigning a +1(-1) 
to the crossing if the rotation is right (left) handed. The 
elements of a relative rotation matrix constructed as above are 
rational numbers . 

[0044] This relative rotation matrix is related to the relative 
rotation rates through the following * equation: 
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Where periodic boundary conditions are used for the matrix. 
[0045] In order to construct a voice signature of the speaker, each 
5 of the vowels spoken by the speaker is characterized. One way of 
characterizing the vowels is by superposing all the relative 
rotation matrices corresponding to the same voiced sound and the 
same speaker and by searching for coincidences in these relative 
rotation matrices, i.e., the rotation niimbers which do not change 

10 when computed from different utterances made by the speaker. These 
coincidences are called ^'^robust rotation numbers'' and are rational 
numbers. Tests were conducted and showed that these robust rotation 
integer numbers for one speaker are. unique to that speaker and 
robust rotation numbers for different speakers are different. 

15 Hence, such robust rotation integer numbers for the speaker are 
similar to fingerprints of the speaker -and can be used as voice 
biometric features for identifying '-the speaker from others. 
[0046] The arrangement of the robust! rotation nimbers placed in the 
original matrix sites is referred to -as a ^Vowelprinf for the 

20 speaker. A collection of vowelprints'of speakers is referred to as 
a "'voiceprint.'' FIG. 4 shows three vowelprint examples 
corresponding to the Spanish vowel^l ta] i for three male subjects of 
nearly the same age. 

[0047] A voiceprint as described above is a collection of discrete 
25 rational numbers that represents unique vocal biometric features of 
a speaker. A speaker can be recognized by comparing such rational 
numbers obtained from the voice of the speaker to a set of rational 
numbers obtained from a known speaker. ■ This comparison between two 
sets of discrete rational numbers avoids metric computation of 
30 distances between spectral features 'and the inherent uncertainties 
in matching different spectral features based on some predetermined 
threshold. In addition, the sizes of digital files for such 
rational numbers are relative small when compared to usually large 
voice data banks for the spectral features in spectral analysis 
35 methods. As a result, the voiceprint of a person may be stored as 
digital codes in various portable storage devices, such as magnetic 
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stripes on credit cards, identification cards (e.g., driver 
licenses) and bank cards, bar codes p'rinted on various surfaces such 
as printed documents (e.g., passports and driver licenses) and ID 
cards , small electronic memory devices, and others. A person can 
conveniently carry the voiceprint and use the voiceprfnt for 
identification, verification and other purposes. 
10048] In implementations, computers or a microprocessor-based 
electronic devices and systems may be used to receive and process 
the voice signals from speakers and extract the rational numbers for 
the voiceprints for the speakers. Such voiceprints may be stored 
for subsequent speaker identification , and verification processes. 
For example, a microphone connected to a computer or microprocessor- 
based electronic device or system may be used to obtain voice 
samples from speakers. The voice '.signals received by the microphone 
are digitized and the digitized voie4 signals are then processed 
using the above described orbits to obtain a set of robust rotation 
numbers for each speaker as the voiceprint. 

[0049] FIG. 5A shows an example of '-'a voice signal as a function of 
time of a speaker that is produced by a' microphone. Segments of the 
voice signal are selected to form the voice spectra for further 
processing. FIG. 5B shows one exampie of a voice power spectrum 
obtained from one segment of the signal in FIG. 5A and a spectrum of 
a selected reference voice signal. In actual training of a system, 
training utterances are recorded from a group of speakers in 
different enrollment sessions. 

[0050] FIG. 5C illustrates an example of linking of two simple 3- 
dimensional orbits 1 and 2 . As described above, the knotting and 
linking of the two orbits 1 and 2 can-be used to obtain relative 
rotation indices or numbers. An orbit generated from the speaker's 
voice signal like in FIG. 3 and a reference orbit can be used to 
obtain the relative rotation matrix based on the relative 
topological relations of the two orbits. FIG. 5D shows an example 
of the relative rotation integer numbers obtained by the topological 
analysis of voice samples. To extract the rational numbers, 
periodic functions based on the spectral features of the recorded 
voiced sounds are constructed. Closed 3-dimensional orbits are 
constructed using phase space reconstruction techniques. After the 
analysis of three-dimensional dynamical systems, linking and 
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knotting properties are extracted from the closed orbits or curves - 
The extracted sets of rational numbers (rotation niambers) are 
arranged in a matrix form as shown in FIG. 5D. Next^ a mold is then 
formed from the final arrangement of the rotation numbers that 
5 remain invariant for a variety of utterances of each speaker. The 
matrix consisting only of the robust numbers placed in the original 
matrix sites may be used to constitute the voice signature, or voice 
mold, for the speaker. 

[0051] FIGS. 6A, 6B, and 6C illustrate the formation of a voice 

10 mold to a particular speaker. The rotation rates of the orbit for 
the voice signal F(f) relative to ^the chosen reference can be 
calculated. For a function F(f) whose embedded orbit has p segments 
and a reference of q segments, a matrix- of p x q rotation numbers 
can be obtained. FIG. 6A shows an example of a 4x4 matrix of 

15 rotation numbers. The matrix element- (i,j) of this matrix 

corresponds to the number of turns of* the segment i of the periodic 
orbit of the speaker relative to the segment j of the reference. 
Each matrix element is a rotation number. A voice mold is computed 
as the invariant rotation niombers d.f -all the utterances of the 

20 training set. As an example, FIG. * 6B. shows 4 different matrices 
obtained from the same speaker for^the same voiced sound. Some 
rotation numbers vary from matrix to another amongst the 4 obtained 
matrices. FIG. 6B further shows 4.. shaded matrix elements that do 
not change in the 4 matrices. Based on the 4 samples in FIG. 6B, a 

25 final matrix for the voice mold is created as shown in FIG. 6C. The 
matrix for the voice mold is still k p x' q matrix as the original 
matrix except that only the invariant matrix elements remain and the 
rest matrix elements are left empty. These empty matrix elements 
correspond to the most varying topological indexes. There is a mold 

30 for every speaker and every voiced sound. The above training 

process is repeated for all speaker^ in order to establish a voice 
data bank for molds of all speakers'' *' 

[0052] After the data bank of voice* molds for the known speakers is 
established and is stored or made accessible by a speaker 
35 recognition system, the system is ready to verify or identify a 

speaker. First, a voice sample from a unknown speaker is obtained 
and a set of rotation rate matrices from the voice sample of the 
unknown speaker who claims to be enrolled in the data bank is 

.-13r 
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computed. These test matrices are compared with the corresponding 
voice mold for each voiced sound. -The unknown speaker is verified 
only if the test matrix fully matches one of the voice molds in the 
data bank (mold matching) . As long as the full-matching criterion 
5 is used, no threshold for acceptance and rejection threshold is 
needed. 

[0053] FIG. 7 shows an example of a- voice mode for a speaker on the 
left (e.g., codes stored in a credit card) and a test matrix 
obtained from an unknown speaker on the right. Out of 6 invariant 
10 rotation numbers in the voice mold on the left, the rotation numbers 
in the matrix on the right only have 3 matches. Therefore, a full 
match lacks in this example and the unknown speaker is determined 
not to be the known speaker. ' 

[0054] The above topological approach to speaker recognition was 

15 successfully tested. A voice bank- was-- Constructed by recording six 
repetitions of a sentence containing -five Spanish vowels for each 
one of 18 speakers, and constructing, topological matrices from short, 
fragments (-100 ms) taken from those vowels. The final voice bank 
had the voiceprints computed from th6- topological matrices for each 

20 of the 18 speakers. ^ 

[0055] Next, a voice sample from a speaker who claimed to be in the 
bank was recorded and topological matrices were computed from the 
recorded voice sample. These candidate matrices wre compared with 
the corresponding vowelprints in the -bank. The speaker was 

25 identified as a member of the bank-ofily if the set of candidate 

matrices fully matches a single stored' voiceprint . In this context, 
full matching means that all the ■ robust - numbers in all the 
vowelprints are present in the corresponding candidate matrices. 
[0056] FIG. 8 shows an example of- this., comparison for a single 

30 vowelprint obtained from the 18 speakers. In FIG. 8, two candidates 
were compared with the bank of molds; ' For each of the two 
candidates, a single vowel print is 'shown. A speaker is identified 
as a member of the bank if the set of'^the speaker'' s candidate 
matrices fully matches a single stored voiceprint. The grey areas 

35 in the molds correspond to positions in the matrices that contain 
robust niambers. Identification of a candidate as a member of the 
bank (i.e., full matching) requires the numbers in those positions 
of the candidate's matrix being equal-to the robust numbers in the 
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mold. Each of the 108 utterances of the voice bank was used as a 
candidate for identification. The tests obtained perfect 
recognition performance without a single false positive or negative 
identification. 

5 [0057] The above choice of a subset of the rotation numbers in the 
construction of a voiceprint may suggest that some information can 
be lost. In order to test this hypothesis, each voiceprint in the 
bank was replaced with the collection of the complete individual 
matrices used to construct them, in.:, such a way that all the 

10 topological information is kept. Each: of the 108 utterances of our 
bank was used as a candidate for idientif ication. Evaluation was 
made for the number of coincidences between the candidate matrices 
and the set of matrices characterizing. -each speaker in the bank. 
The result was a lower performance method, since several false 

15 positives and negatives were found.': i v Therefore, the topological 

robust numbers seem to strengthen the relevant spectral information, 
discarding the unnecessary information carried by the indexes that 
vary the most from one utterance to . the- next. 

[0058] In addition, a comparison bfetwfeen the above topological 
20 approach and a metric method was mad^.: In the metric method, the 
quadratic distance between spectra* was- calculated and coincidences 
were computed below an optimized threshold. In this case, the 
voiceprint of each speaker in the bank was replaced by the spectral 
functions used to construct the rotation matrices. The performance 
25 of this metric method as a speaker recognizer was worse than the 
topologic method. ; = : ' , 

[0059] The present topological approaich presents many interesting 
advantages over various metric methods. In a metric strategy in 
which some distance between spectra' ^ are- computed, a threshold has to 

30 be defined, and this is a bank dependent quantity. The use of 

topological voiceprints constructed 'with rational numbers, along 
with the full-matching criterion, introduces a novel strategy, which 
is bank- independent, with no-threshold needed to verify the 
acceptance. • 

35 [0060] Implementations of the topological approach running on 

standard personal computers were conducted and the tests suggest 
that the topological processing on PCS are fast. Once an utterance 
is recorded, voiced sounds segments can easily be extracted. Their 

-15-.. 
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relative rotation matrices can b;e budii: using simple cross-counting 
algorithms (see, e.g., the cited Gilmore paper) and voiceprints are 
then computed by simply counting coincidences over a collection of 
small matrices. Once the voice data bank is constructed, the whole 
5 recognition task is the matching of .small matrices. 

[0061] In the present topological approach, the change in the 
niHnber of robust numbers is found to be a function of the training 
set size. For training sets larger than 10 vowels, the number of 
robust numbers converges to approximately 8. These numbers describe 

10 the relative heights of the peaks of the spectral function of a 

voiced sound with respect to the sp.ectriim of a reference, that do 
not change from utterance to utterance. The robust numbers of a 
subject in our base were compared Withithe topological indexes 
obtained from an utterance recorded ^when the subject had a strong 

15 cold and thus had a changed voice . v/i Tests suggested that the 

information in the matrix of robust'-numbers degrades gracefully: 
only the indexes associated with the». highest frequencies changed, 
while a large part of the voice print*- remained unaltered. 
[0062] Various systems may employ the present topological voice 

20 recognition method. One simple implefaentation may use a processing 
unit that may be a computer or include a microprocessor for 
processing voice signals from a microphone connected to the 
processing unit. A storage medium;: -such an electronic storage 
device, a magnetic storage device (e.g;, harddrive in a PC), or 

25 optical storage device, may be u'sed^Ho' store the topological 

voiceprints for known speakers. A'«^iis'er -provides a voice sample by 
speaking to the microphone. The- processing unit first processes the 
voice sample from the user to extract-the user's topological voice 
indices and then compares the user<s topological voice indices to 

30 the indices stored in the storage device to search for a match 
between the user and one of the known' speakers in the database. 
[0063] FIG. 9 shows an example of. a speaker recognition system that 
implements the above topological' approach. FIG. 10 shows the 
operational flow of the system in FIG. 9. The system includes a 

35 processing unit that may be a computer or include a microprocessor 
for processing voice signals based .on' the topological approach and 
comparing the voice mold read from'-ii'^ reader head and a test matrix 
constructed from a voice signal. Anlinput microphone is connected 
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to the processing unit and operates to record voice signals from 
speakers. A reader head is also connected to the processing unit 
and operates to read stored rational.. numbers for voice molds for one 
or more known speakers on a portable storage device such as a 
magnetic card, an optical an optical storage device, a card printed 
with a bar code encoded with the. rational numbers, or an electronic 
storage device or memory card. 

[0064] As an example, the reader he-ad is assumed to be a magnetic 
reader and the portable storage device is a magnetic card that 
stores digital codes for one or more- voice molds of a known speaker. 
A card holder who claims to be theiiknowi?- speaker is asked to slide 
the card through the reader and to speak to the microphone so that 
his voice samples can be obtained. The 'processing unit processes 
the voice samples to extract the topological rational numbers and 
compare them to the rational numbers- tead from the card. When there 
is a full match between all rational numbers, the card user is 
verified as the known speaker whosd!;v6iceprint is stored on the 
card. An access to, e.g., a bank account or a computer system, can 
be granted to the card user. 

[0065] Computer security verification, systems based on the present 
topological approach may be implemented via computer networks where 
the digitized voice samples from a"'user:may be sent through a 
network to reach a processing unit '.that .determines whether the 
user's voice matches a known speaker's voice stored in the 
topological data bank. Such application may be applied to the 
Internet, telephone lines and networks, wireless communication links 
such as wireless phone networks and'- wireless data networks. Various 
applications may incorporate the present topological voice 
recognition as part of or entire* verification process such as 
electronic banking or finance, on-line- shopping, verification of 
various identification documents like passports, ID cards, and 
verification of user identity in bank'^ cards, credit cards, 
electronic trading, telephone access;* keyless entry (cars, homes, 
offices, etc.) and driver's licenses . t-' • 

[0066] Only a few implementations are -described. However, it is 
understood that variations and enhancements may be made. 
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